This notebook contains a first pass look at the data provided, some high-level views of the data and the cleaning that will be necessary, and a first analysis of each of the fields using the package pandas_profiling
The first cell will contain all the necessary package imports
import os
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from pandas_profiling import ProfileReport
%matplotlib inline
survey_data_file = "../data/survey_data.csv"
app_usage_file = "../data/survey_users_app_usage.csv"
df_survey = pd.read_csv(survey_data_file)
df_app = pd.read_csv(app_usage_file, parse_dates=['duolingo_start_date'])
df_survey.head()
| user_id | age | annual_income | country | duolingo_platform | duolingo_subscriber | duolingo_usage | employment_status | future_contact | gender | other_resources | primary_language_commitment | primary_language_review | primary_language_motivation | primary_language_motivation_followup | primary_language_proficiency | student | survey_complete | time_spent_seconds | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 35c9fc6e72c911e99681dca9049399ef | 18-34 | $26,000 - $75,000 | JP | Android phone or tablet | No, I have never paid for Duolingo Plus | Daily | Employed full-time | Yes | Male | Stories/novels/children's books,Movies/TV Shows | I'm very committed to learning this language. | I am using Duolingo to review a language I've ... | I like to learn new languages | I want to learn as many languages as I can,Oth... | Advanced | Not currently a student | 1 | 193 |
| 1 | 35c9fdde72c911e98630dca9049399ef | 18-34 | $26,000 - $75,000 | JP | iPhone or iPad | No, I have never paid for Duolingo Plus | Weekly | Employed full-time | Yes | Male | NaN | I'm slightly committed to learning this language. | I am using Duolingo to review a language I've ... | I need to be able to speak the local language ... | I am an immigrant,I am a refugee | Intermediate | Not currently a student | 1 | 139 |
| 2 | 35c9feb072c911e9ab4cdca9049399ef | 18-34 | $76,000 - $150,000 | JP | iPhone or iPad | Yes, I currently pay for Duolingo Plus | Daily | Employed full-time | Yes | Male | NaN | I'm moderately committed to learning this lang... | I am using Duolingo to review a language I've ... | I want to connect with my heritage or identity | NaN | Beginner | Not currently a student | 1 | 119 |
| 3 | 35c9ff7072c911e9900ddca9049399ef | 18-34 | $76,000 - $150,000 | JP | iPhone or iPad | No, but I have previously paid for Duolingo Plus | Daily | Employed full-time | Yes | Female | Other apps | I'm very committed to learning this language. | I am using Duolingo to learn this language for... | I am preparing for a trip | I want to learn some basics in the local langu... | Intermediate | Not currently a student | 1 | 229 |
| 4 | 35ca002672c911e99effdca9049399ef | 35 - 54 | $76,000 - $150,000 | JP | Android phone or tablet | Yes, I currently pay for Duolingo Plus | Daily | Employed full-time | Yes | Male | NaN | I'm very committed to learning this language. | I am using Duolingo to learn this language for... | I want to connect with my heritage or identity | NaN | Intermediate | Not currently a student | 1 | 157 |
df_survey.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 6187 entries, 0 to 6186 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 user_id 6187 non-null object 1 age 5838 non-null object 2 annual_income 5182 non-null object 3 country 6187 non-null object 4 duolingo_platform 5911 non-null object 5 duolingo_subscriber 5901 non-null object 6 duolingo_usage 5911 non-null object 7 employment_status 5730 non-null object 8 future_contact 5446 non-null object 9 gender 5838 non-null object 10 other_resources 4474 non-null object 11 primary_language_commitment 6022 non-null object 12 primary_language_review 6014 non-null object 13 primary_language_motivation 5948 non-null object 14 primary_language_motivation_followup 3715 non-null object 15 primary_language_proficiency 6027 non-null object 16 student 5523 non-null object 17 survey_complete 6187 non-null int64 18 time_spent_seconds 6187 non-null int64 dtypes: int64(2), object(17) memory usage: 918.5+ KB
print(len(df_survey['user_id'].unique()))
6150
With 6187 rows, but only 6150 unique user_id's - there are 37 duplicates. Let's take a quick look at the duplicate users.
df_survey[df_survey.duplicated(subset='user_id',keep=False)].sort_values(by='user_id')
| user_id | age | annual_income | country | duolingo_platform | duolingo_subscriber | duolingo_usage | employment_status | future_contact | gender | other_resources | primary_language_commitment | primary_language_review | primary_language_motivation | primary_language_motivation_followup | primary_language_proficiency | student | survey_complete | time_spent_seconds | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 30 | 35ca11b872c911e984abdca9049399ef | 18-34 | $11,000 - $25,000 | JP | iPhone or iPad | No, I have never paid for Duolingo Plus | Daily | Employed part-time | Yes | Female | Speak with others (language events, conversati... | I'm extremely committed to learning this langu... | I am using Duolingo to learn this language for... | I want to challenge myself | NaN | Beginner | Full-time student | 1 | 169 |
| 447 | 35ca11b872c911e984abdca9049399ef | 18-34 | $0 - $10,000 | JP | iPhone or iPad | No, I have never paid for Duolingo Plus | Daily | Employed part-time | Yes | Female | Online language class | I'm extremely committed to learning this langu... | I am using Duolingo to learn this language for... | I want to challenge myself | NaN | Beginner | Part-time student | 1 | 346 |
| 6064 | 35ca57a372c911e98c79dca9049399ef | 35 - 54 | NaN | DE | Android phone or tablet | No, I have never paid for Duolingo Plus | Daily | Employed full-time | Yes | Male | Other apps | I'm very committed to learning this language. | I am using Duolingo to learn this language for... | I need to be able to speak the local language ... | I am studying abroad | Beginner | Not currently a student | 1 | 243 |
| 6030 | 35ca57a372c911e98c79dca9049399ef | 18-34 | $11,000 - $25,000 | JP | iPhone or iPad | No, I have never paid for Duolingo Plus | Daily | Unemployed | Yes | Male | Textbooks,Stories/novels/children's books,Movi... | I'm extremely committed to learning this langu... | I am using Duolingo to learn this language for... | I want my family to learn a language together | NaN | Intermediate | Not currently a student | 1 | 311 |
| 6035 | 35ca6ffa72c911e993bcdca9049399ef | 18-34 | $26,000 - $75,000 | JP | iPhone or iPad | Yes, I currently pay for Duolingo Plus | Daily | Employed full-time | Yes | Female | Movies/TV Shows | I'm extremely committed to learning this langu... | I am using Duolingo to review a language I've ... | I am preparing for a trip | I want to communicate with locals in a meaning... | Intermediate | Not currently a student | 1 | 150 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 6036 | 35d6a26672c911e99139dca9049399ef | NaN | NaN | US | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | 13 |
| 6092 | 35d72ec072c911e9888ddca9049399ef | 18-34 | $0 - $10,000 | US | iPhone or iPad | No, I have never paid for Duolingo Plus | Weekly | Employed part-time | Yes | Female | In-person language class,Online language class... | I'm very committed to learning this language. | I am using Duolingo to review a language I've ... | I need to learn this language for school | I am required to use Duolingo in a class I am ... | Beginner | Full-time student | 1 | 497 |
| 6057 | 35d72ec072c911e9888ddca9049399ef | 18-34 | $11,000 - $25,000 | BR | iPhone or iPad | No, I have never paid for Duolingo Plus | Weekly | Employed part-time | Yes | Female | NaN | I'm very committed to learning this language. | I am using Duolingo to learn this language for... | I want to use my time more productively | I want to spend less time on social media,Othe... | Intermediate | Not currently a student | 1 | 328 |
| 6082 | 35d73b1972c911e9b6ffdca9049399ef | NaN | NaN | DE | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | 24 |
| 6047 | 35d73b1972c911e9b6ffdca9049399ef | NaN | NaN | DE | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | 36 |
74 rows × 19 columns
There is no obvious pattern in the duplicate records as far as I can tell. In some cases, the same user_id shows up with very different responses including different countries, ages, income levels, etc.
Using the package pandas_profiling, I can get a quick and dirty look at the underlying data, including histograms of the distributions of each variable, correlations, data types, etc.
profile = ProfileReport(df_survey, title="Survey Data Profiling Report", explorative=True)
profile.to_widgets()
profile
Now that we've examined the survey data, let's look at the data in the app usage data file
df_app.head()
| user_id | duolingo_start_date | daily_goal | highest_course_progress | took_placement_test | purchased_subscription | highest_crown_count | n_active_days | n_lessons_started | n_lessons_completed | longest_streak | n_days_on_platform | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 35cb7e8f72c911e9888edca9049399ef | 2018-06-20 21:14:00 | NaN | 46.0 | True | False | 277.0 | 88 | 741.0 | 668.0 | 135 | 137 |
| 1 | 35ca34fd72c911e99ed6dca9049399ef | 2017-08-08 05:01:00 | NaN | 50.0 | True | False | 62.0 | 16 | 57.0 | 57.0 | 6 | 453 |
| 2 | 35d1a54a72c911e98e25dca9049399ef | 2014-10-15 17:55:00 | 1.0 | 71.0 | False | False | 202.0 | 29 | 315.0 | 295.0 | 55 | 1481 |
| 3 | 35d4beb072c911e9aa92dca9049399ef | 2018-10-05 09:28:00 | NaN | 2.0 | False | False | 2.0 | 3 | 6.0 | 5.0 | 1 | 30 |
| 4 | 35ccf4bd72c911e9be2edca9049399ef | 2015-09-17 03:16:00 | NaN | 34.0 | False | False | 216.0 | 57 | 338.0 | 297.0 | 56 | 1144 |
df_app.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 6149 entries, 0 to 6148 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 user_id 6149 non-null object 1 duolingo_start_date 6149 non-null datetime64[ns] 2 daily_goal 2687 non-null float64 3 highest_course_progress 6135 non-null float64 4 took_placement_test 6135 non-null object 5 purchased_subscription 6149 non-null bool 6 highest_crown_count 5857 non-null float64 7 n_active_days 6149 non-null int64 8 n_lessons_started 5993 non-null float64 9 n_lessons_completed 5993 non-null float64 10 longest_streak 6149 non-null int64 11 n_days_on_platform 6149 non-null int64 dtypes: bool(1), datetime64[ns](1), float64(5), int64(3), object(2) memory usage: 534.6+ KB
print(len(df_app['user_id'].unique()))
6114
There are 35 duplicates in this file as well.
Let's now generate a profile of the app usage data
profile_usage = ProfileReport(df_app, title="App Usage Data Profiling Report", explorative=True)
profile_usage.to_widgets()
profile_usage